{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Pff8TULfBfmW"
      },
      "source": [
        "# MongoDB Atlas Vector Search with VoyageAI Embeddings for Sports Scores and Stories\n",
        "\n",
        "This notebook demonstrates how to use VoyageAI embeddings with MongoDB Atlas Vector Search for retrieving relevant sports scores and stories based on user queries."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nFlj2GR6BfmX"
      },
      "source": [
        "## Overview\n",
        "\n",
        "In this tutorial, we'll learn how to:\n",
        "\n",
        "1. Connect to MongoDB Atlas and retrieve sports data\n",
        "2. Generate embeddings using VoyageAI's embedding models\n",
        "3. Store these embeddings in MongoDB\n",
        "4. Create and use a vector search index for semantic similarity search\n",
        "5. Use hybrid search for result tuning.\n",
        "6. Implement a RAG (Retrieval-Augmented Generation) system to answer questions about sports teams and matches\n",
        "7. Showing how Agentic rag changes the results by using hybrid search as tools for an ai-agent built with the openai-agent sdk.\n",
        "\n",
        "This approach combines the power of vector embeddings with natural language processing to provide relevant sports information based on user queries."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Bv3ypa32BfmY"
      },
      "source": [
        "## Setup and Configuration\n",
        "\n",
        "First, let's import the necessary libraries and set up our environment. We'll need libraries for data manipulation, machine learning, visualization, and MongoDB connectivity."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "x-zn2F9dBfmY",
        "outputId": "12c58d0a-f4c1-4d1c-928e-92c75fb0c20d"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Collecting voyageai\n",
            "  Downloading voyageai-0.3.2-py3-none-any.whl.metadata (2.6 kB)\n",
            "Collecting pymongo\n",
            "  Downloading pymongo-4.11.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (2.0.2)\n",
            "Requirement already satisfied: pandas in /usr/local/lib/python3.11/dist-packages (2.2.2)\n",
            "Requirement already satisfied: matplotlib in /usr/local/lib/python3.11/dist-packages (3.10.0)\n",
            "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.11/dist-packages (1.6.1)\n",
            "Requirement already satisfied: python-dotenv in /usr/local/lib/python3.11/dist-packages (1.1.0)\n",
            "Requirement already satisfied: openai in /usr/local/lib/python3.11/dist-packages (1.68.2)\n",
            "Requirement already satisfied: aiohttp in /usr/local/lib/python3.11/dist-packages (from voyageai) (3.11.14)\n",
            "Collecting aiolimiter (from voyageai)\n",
            "  Downloading aiolimiter-1.2.1-py3-none-any.whl.metadata (4.5 kB)\n",
            "Requirement already satisfied: pillow in /usr/local/lib/python3.11/dist-packages (from voyageai) (11.1.0)\n",
            "Requirement already satisfied: pydantic>=1.10.8 in /usr/local/lib/python3.11/dist-packages (from voyageai) (2.10.6)\n",
            "Requirement already satisfied: requests in /usr/local/lib/python3.11/dist-packages (from voyageai) (2.32.3)\n",
            "Requirement already satisfied: tenacity in /usr/local/lib/python3.11/dist-packages (from voyageai) (9.0.0)\n",
            "Requirement already satisfied: tokenizers>=0.14.0 in /usr/local/lib/python3.11/dist-packages (from voyageai) (0.21.1)\n",
            "Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)\n",
            "  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)\n",
            "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas) (2.8.2)\n",
            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas) (2025.1)\n",
            "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas) (2025.1)\n",
            "Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (1.3.1)\n",
            "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (0.12.1)\n",
            "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (4.56.0)\n",
            "Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (1.4.8)\n",
            "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (24.2)\n",
            "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (3.2.1)\n",
            "Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn) (1.14.1)\n",
            "Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn) (1.4.2)\n",
            "Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn) (3.6.0)\n",
            "Requirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.11/dist-packages (from openai) (4.9.0)\n",
            "Requirement already satisfied: distro<2,>=1.7.0 in /usr/local/lib/python3.11/dist-packages (from openai) (1.9.0)\n",
            "Requirement already satisfied: httpx<1,>=0.23.0 in /usr/local/lib/python3.11/dist-packages (from openai) (0.28.1)\n",
            "Requirement already satisfied: jiter<1,>=0.4.0 in /usr/local/lib/python3.11/dist-packages (from openai) (0.9.0)\n",
            "Requirement already satisfied: sniffio in /usr/local/lib/python3.11/dist-packages (from openai) (1.3.1)\n",
            "Requirement already satisfied: tqdm>4 in /usr/local/lib/python3.11/dist-packages (from openai) (4.67.1)\n",
            "Requirement already satisfied: typing-extensions<5,>=4.11 in /usr/local/lib/python3.11/dist-packages (from openai) (4.12.2)\n",
            "Requirement already satisfied: idna>=2.8 in /usr/local/lib/python3.11/dist-packages (from anyio<5,>=3.5.0->openai) (3.10)\n",
            "Requirement already satisfied: certifi in /usr/local/lib/python3.11/dist-packages (from httpx<1,>=0.23.0->openai) (2025.1.31)\n",
            "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.11/dist-packages (from httpx<1,>=0.23.0->openai) (1.0.7)\n",
            "Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.11/dist-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai) (0.14.0)\n",
            "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic>=1.10.8->voyageai) (0.7.0)\n",
            "Requirement already satisfied: pydantic-core==2.27.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=1.10.8->voyageai) (2.27.2)\n",
            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n",
            "Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /usr/local/lib/python3.11/dist-packages (from tokenizers>=0.14.0->voyageai) (0.29.3)\n",
            "Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->voyageai) (2.6.1)\n",
            "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.11/dist-packages (from aiohttp->voyageai) (1.3.2)\n",
            "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->voyageai) (25.3.0)\n",
            "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.11/dist-packages (from aiohttp->voyageai) (1.5.0)\n",
            "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.11/dist-packages (from aiohttp->voyageai) (6.2.0)\n",
            "Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->voyageai) (0.3.0)\n",
            "Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->voyageai) (1.18.3)\n",
            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests->voyageai) (3.4.1)\n",
            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests->voyageai) (2.3.0)\n",
            "Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers>=0.14.0->voyageai) (3.18.0)\n",
            "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.11/dist-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers>=0.14.0->voyageai) (2025.3.0)\n",
            "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.11/dist-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers>=0.14.0->voyageai) (6.0.2)\n",
            "Downloading voyageai-0.3.2-py3-none-any.whl (25 kB)\n",
            "Downloading pymongo-4.11.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.4/1.4 MB\u001b[0m \u001b[31m26.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading dnspython-2.7.0-py3-none-any.whl (313 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m313.6/313.6 kB\u001b[0m \u001b[31m20.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading aiolimiter-1.2.1-py3-none-any.whl (6.7 kB)\n",
            "Installing collected packages: dnspython, aiolimiter, pymongo, voyageai\n",
            "Successfully installed aiolimiter-1.2.1 dnspython-2.7.0 pymongo-4.11.3 voyageai-0.3.2\n"
          ]
        }
      ],
      "source": [
        "%pip install voyageai pymongo   scikit-learn python-dotenv openai"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "iUxNlwccBfmY",
        "outputId": "60b3ec1e-8cbe-417b-eeb3-56e46b848043"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "False"
            ]
          },
          "execution_count": 5,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "import logging\n",
        "import os\n",
        "from datetime import datetime, timedelta\n",
        "\n",
        "import voyageai\n",
        "from dotenv import load_dotenv\n",
        "from openai import OpenAI\n",
        "from pymongo import MongoClient\n",
        "\n",
        "# Set up logging\n",
        "logging.basicConfig(\n",
        "    level=logging.INFO, format=\"%(asctime)s - %(levelname)s - %(message)s\"\n",
        ")\n",
        "\n",
        "# Load environment variables\n",
        "load_dotenv()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VhOZWmjCBfmY"
      },
      "source": [
        "### Environment Variables\n",
        "\n",
        "We'll use environment variables to store sensitive information like API keys and connection strings. These should be stored in a `.env` file in the same directory as this notebook.\n",
        "\n",
        "Example `.env` file content:\n",
        "```\n",
        "MONGODB_URI=mongodb+srv://username:password@cluster.mongodb.net/\n",
        "VOYAGE_API_KEY=your_voyage_api_key_here\n",
        "OPENAI_API_KEY=your_openai_api_key_here\n",
        "```"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "lQHVhbeOBfmY",
        "outputId": "05be8e3f-74a4-4272-9e8d-eb5a4b6f7b4d"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Enter your MongoDB connection string: ··········\n",
            "Enter your VoyageAI API key: ··········\n",
            "Enter your OpenAI API key: ··········\n",
            "Environment variables loaded successfully\n"
          ]
        }
      ],
      "source": [
        "# MongoDB connection string\n",
        "import getpass\n",
        "\n",
        "MONGODB_URI = getpass.getpass(\"Enter your MongoDB connection string: \")\n",
        "# VoyageAI API key for embeddings\n",
        "VOYAGE_API_KEY = getpass.getpass(\"Enter your VoyageAI API key: \")\n",
        "# OpenAI API key for RAG\n",
        "OPENAI_API_KEY = getpass.getpass(\"Enter your OpenAI API key: \")\n",
        "\n",
        "\n",
        "# Check if environment variables are set\n",
        "if not MONGODB_URI or not VOYAGE_API_KEY or not OPENAI_API_KEY:\n",
        "    print(\n",
        "        \"Error: Environment variables MONGODB_URI, VOYAGE_API_KEY, and OPENAI_API_KEY must be set\"\n",
        "    )\n",
        "    print(\"Please create a .env file with these variables\")\n",
        "else:\n",
        "    print(\"Environment variables loaded successfully\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VU_EOcrPBfmY"
      },
      "source": [
        "### MongoDB Configuration\n",
        "\n",
        "Now let's set up our MongoDB connection and define the database and collections we'll be using."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "jmpMJ-dUBfmZ",
        "outputId": "8f6d94ff-5543-4830-9df8-16acd129190f"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "MongoDB connection successful\n"
          ]
        }
      ],
      "source": [
        "# MongoDB configuration\n",
        "DB_NAME = \"sports_demo\"\n",
        "COLLECTION_NAME = \"matches\"\n",
        "TEAMS_COLLECTION = \"teams\"\n",
        "NEWS_COLLECTION = \"news\"\n",
        "VECTOR_COLLECTION = \"vector_features\"\n",
        "ATLAS_VECTOR_SEARCH_INDEX_NAME = \"voyage_vector_index\"\n",
        "\n",
        "# Initialize MongoDB client\n",
        "client = MongoClient(MONGODB_URI, appname=\"voyageai.mongodb.sports_scores_demo\")\n",
        "\n",
        "# Access collections\n",
        "matches_collection = client[DB_NAME][COLLECTION_NAME]\n",
        "teams_collection = client[DB_NAME][TEAMS_COLLECTION]\n",
        "news_collection = client[DB_NAME][NEWS_COLLECTION]\n",
        "vector_collection = client[DB_NAME][VECTOR_COLLECTION]\n",
        "\n",
        "# Test the connection\n",
        "try:\n",
        "    # The ismaster command is cheap and does not require auth\n",
        "    client.admin.command(\"ismaster\")\n",
        "    print(\"MongoDB connection successful\")\n",
        "except Exception as e:\n",
        "    print(f\"MongoDB connection failed: {e}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IdoEexV0BfmZ"
      },
      "source": [
        "## VoyageAI Embeddings\n",
        "\n",
        "Next, we'll create a class to handle generating embeddings using VoyageAI's API. Embeddings are vector representations of text that capture semantic meaning, allowing us to perform operations like similarity search."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "id": "thuabhFlBfmZ"
      },
      "outputs": [],
      "source": [
        "class VoyageAIEmbeddings:\n",
        "    \"\"\"Custom VoyageAI embeddings class\"\"\"\n",
        "\n",
        "    def __init__(self, api_key, model=\"voyage-3\"):\n",
        "        self.api_key = api_key\n",
        "        self.model = model\n",
        "        os.environ[\"VOYAGE_API_KEY\"] = api_key\n",
        "        self.client = voyageai.Client(api_key=api_key)\n",
        "\n",
        "    def embed_text(self, text):\n",
        "        \"\"\"Embed a single text using VoyageAI\"\"\"\n",
        "        response = self.client.embed([text], model=self.model, input_type=\"document\")\n",
        "        return response.embeddings[0]\n",
        "\n",
        "    def embed_batch(self, texts, batch_size=20):\n",
        "        \"\"\"Embed a batch of texts efficiently\"\"\"\n",
        "        embeddings = []\n",
        "        for i in range(0, len(texts), batch_size):\n",
        "            batch = texts[i : i + batch_size]\n",
        "            response = self.client.embed(batch, model=self.model, input_type=\"document\")\n",
        "            embeddings.extend(response.embeddings)\n",
        "            print(f\"Processed {i+len(batch)}/{len(texts)} embeddings\")\n",
        "        return embeddings"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_mOs2FXvBfmZ"
      },
      "source": [
        "### Understanding Embeddings\n",
        "\n",
        "Embeddings are dense vector representations of text that capture semantic meaning. The VoyageAI model we're using (`voyage-3`) generates 1024-dimensional vectors for each text input. These vectors have several important properties:\n",
        "\n",
        "1. **Semantic similarity**: Texts with similar meanings will have embeddings that are close to each other in the vector space\n",
        "2. **Dimensionality**: The high-dimensional space allows for capturing complex relationships between concepts\n",
        "3. **Language understanding**: The model has been trained on vast amounts of text data to understand language nuances\n",
        "\n",
        "In our case, we'll use these embeddings to represent sports data in a way that captures the semantic meaning of team names, match descriptions, and news stories."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yBpBHtSPBfmZ"
      },
      "source": [
        "## Sample Data Generation\n",
        "\n",
        "For demonstration purposes, let's create some sample sports data. In a real-world scenario, this data would come from an API or another data source."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Wh-p5KVFBfmZ",
        "outputId": "97fbf071-3027-4e29-a8e4-0a8d5637a617"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Generating sample sports data...\n",
            "Inserted 15 teams, 7 matches, and 5 news stories\n"
          ]
        }
      ],
      "source": [
        "def generate_sample_data():\n",
        "    \"\"\"Generate sample sports data for demonstration purposes\"\"\"\n",
        "    print(\"Generating sample sports data...\")\n",
        "\n",
        "    # Sample teams with nicknames\n",
        "    teams = [\n",
        "        {\n",
        "            \"team_id\": \"MNU\",\n",
        "            \"name\": \"Manchester United\",\n",
        "            \"nicknames\": [\"Red Devils\", \"United\"],\n",
        "            \"league\": \"Premier League\",\n",
        "            \"country\": \"England\",\n",
        "        },\n",
        "        {\n",
        "            \"team_id\": \"MNC\",\n",
        "            \"name\": \"Manchester City\",\n",
        "            \"nicknames\": [\"Citizens\", \"City\"],\n",
        "            \"league\": \"Premier League\",\n",
        "            \"country\": \"England\",\n",
        "        },\n",
        "        {\n",
        "            \"team_id\": \"LIV\",\n",
        "            \"name\": \"Liverpool\",\n",
        "            \"nicknames\": [\"Reds\", \"The Kop\"],\n",
        "            \"league\": \"Premier League\",\n",
        "            \"country\": \"England\",\n",
        "        },\n",
        "        {\n",
        "            \"team_id\": \"CHE\",\n",
        "            \"name\": \"Chelsea\",\n",
        "            \"nicknames\": [\"Blues\", \"The Pensioners\"],\n",
        "            \"league\": \"Premier League\",\n",
        "            \"country\": \"England\",\n",
        "        },\n",
        "        {\n",
        "            \"team_id\": \"ARS\",\n",
        "            \"name\": \"Arsenal\",\n",
        "            \"nicknames\": [\"Gunners\", \"The Arsenal\"],\n",
        "            \"league\": \"Premier League\",\n",
        "            \"country\": \"England\",\n",
        "        },\n",
        "        {\n",
        "            \"team_id\": \"TOT\",\n",
        "            \"name\": \"Tottenham Hotspur\",\n",
        "            \"nicknames\": [\"Spurs\", \"Lilywhites\"],\n",
        "            \"league\": \"Premier League\",\n",
        "            \"country\": \"England\",\n",
        "        },\n",
        "        {\n",
        "            \"team_id\": \"BAR\",\n",
        "            \"name\": \"Barcelona\",\n",
        "            \"nicknames\": [\"Barça\", \"Blaugrana\"],\n",
        "            \"league\": \"La Liga\",\n",
        "            \"country\": \"Spain\",\n",
        "        },\n",
        "        {\n",
        "            \"team_id\": \"RMA\",\n",
        "            \"name\": \"Real Madrid\",\n",
        "            \"nicknames\": [\"Los Blancos\", \"Merengues\"],\n",
        "            \"league\": \"La Liga\",\n",
        "            \"country\": \"Spain\",\n",
        "        },\n",
        "        {\n",
        "            \"team_id\": \"ATM\",\n",
        "            \"name\": \"Atletico Madrid\",\n",
        "            \"nicknames\": [\"Atleti\", \"Colchoneros\"],\n",
        "            \"league\": \"La Liga\",\n",
        "            \"country\": \"Spain\",\n",
        "        },\n",
        "        {\n",
        "            \"team_id\": \"BAY\",\n",
        "            \"name\": \"Bayern Munich\",\n",
        "            \"nicknames\": [\"Die Roten\", \"Bavarians\"],\n",
        "            \"league\": \"Bundesliga\",\n",
        "            \"country\": \"Germany\",\n",
        "        },\n",
        "        {\n",
        "            \"team_id\": \"BVB\",\n",
        "            \"name\": \"Borussia Dortmund\",\n",
        "            \"nicknames\": [\"BVB\", \"Die Schwarzgelben\"],\n",
        "            \"league\": \"Bundesliga\",\n",
        "            \"country\": \"Germany\",\n",
        "        },\n",
        "        {\n",
        "            \"team_id\": \"JUV\",\n",
        "            \"name\": \"Juventus\",\n",
        "            \"nicknames\": [\"Old Lady\", \"Bianconeri\"],\n",
        "            \"league\": \"Serie A\",\n",
        "            \"country\": \"Italy\",\n",
        "        },\n",
        "        {\n",
        "            \"team_id\": \"INT\",\n",
        "            \"name\": \"Inter Milan\",\n",
        "            \"nicknames\": [\"Nerazzurri\", \"La Beneamata\"],\n",
        "            \"league\": \"Serie A\",\n",
        "            \"country\": \"Italy\",\n",
        "        },\n",
        "        {\n",
        "            \"team_id\": \"ACM\",\n",
        "            \"name\": \"AC Milan\",\n",
        "            \"nicknames\": [\"Rossoneri\", \"Diavolo\"],\n",
        "            \"league\": \"Serie A\",\n",
        "            \"country\": \"Italy\",\n",
        "        },\n",
        "        {\n",
        "            \"team_id\": \"PSG\",\n",
        "            \"name\": \"Paris Saint-Germain\",\n",
        "            \"nicknames\": [\"Les Parisiens\", \"PSG\"],\n",
        "            \"league\": \"Ligue 1\",\n",
        "            \"country\": \"France\",\n",
        "        },\n",
        "    ]\n",
        "\n",
        "    # Generate sample matches (recent results)\n",
        "    now = datetime.now()\n",
        "    matches = []\n",
        "\n",
        "    # Premier League matches\n",
        "    matches.extend(\n",
        "        [\n",
        "            {\n",
        "                \"match_id\": \"PL2023-001\",\n",
        "                \"home_team\": \"MNU\",\n",
        "                \"away_team\": \"LIV\",\n",
        "                \"home_score\": 2,\n",
        "                \"away_score\": 1,\n",
        "                \"date\": (now - timedelta(days=2)).strftime(\"%Y-%m-%d\"),\n",
        "                \"competition\": \"Premier League\",\n",
        "                \"season\": \"2023-2024\",\n",
        "                \"stadium\": \"Old Trafford\",\n",
        "                \"summary\": \"Manchester United secured a thrilling 2-1 victory over Liverpool at Old Trafford. Bruno Fernandes opened the scoring with a penalty in the 34th minute, before Marcus Rashford doubled the lead with a brilliant solo effort in the 67th minute. Mohamed Salah pulled one back for Liverpool in the 85th minute, but United held on for a crucial win.\",\n",
        "            },\n",
        "            {\n",
        "                \"match_id\": \"PL2023-002\",\n",
        "                \"home_team\": \"ARS\",\n",
        "                \"away_team\": \"MNC\",\n",
        "                \"home_score\": 1,\n",
        "                \"away_score\": 1,\n",
        "                \"date\": (now - timedelta(days=3)).strftime(\"%Y-%m-%d\"),\n",
        "                \"competition\": \"Premier League\",\n",
        "                \"season\": \"2023-2024\",\n",
        "                \"stadium\": \"Emirates Stadium\",\n",
        "                \"summary\": \"Arsenal and Manchester City played out an entertaining 1-1 draw at the Emirates Stadium. Erling Haaland gave City the lead in the 23rd minute with a powerful header, but Bukayo Saka equalized for the Gunners in the 59th minute with a well-placed shot from the edge of the box.\",\n",
        "            },\n",
        "            {\n",
        "                \"match_id\": \"PL2023-003\",\n",
        "                \"home_team\": \"CHE\",\n",
        "                \"away_team\": \"TOT\",\n",
        "                \"home_score\": 3,\n",
        "                \"away_score\": 0,\n",
        "                \"date\": (now - timedelta(days=1)).strftime(\"%Y-%m-%d\"),\n",
        "                \"competition\": \"Premier League\",\n",
        "                \"season\": \"2023-2024\",\n",
        "                \"stadium\": \"Stamford Bridge\",\n",
        "                \"summary\": \"Chelsea dominated Tottenham in a 3-0 London derby win at Stamford Bridge. Cole Palmer scored twice in the first half, and Nicolas Jackson added a third in the 78th minute to complete the rout. Spurs struggled to create chances throughout the match.\",\n",
        "            },\n",
        "        ]\n",
        "    )\n",
        "\n",
        "    # La Liga matches\n",
        "    matches.extend(\n",
        "        [\n",
        "            {\n",
        "                \"match_id\": \"LL2023-001\",\n",
        "                \"home_team\": \"BAR\",\n",
        "                \"away_team\": \"RMA\",\n",
        "                \"home_score\": 3,\n",
        "                \"away_score\": 2,\n",
        "                \"date\": (now - timedelta(days=4)).strftime(\"%Y-%m-%d\"),\n",
        "                \"competition\": \"La Liga\",\n",
        "                \"season\": \"2023-2024\",\n",
        "                \"stadium\": \"Camp Nou\",\n",
        "                \"summary\": \"Barcelona edged Real Madrid 3-2 in an exciting El Clásico at Camp Nou. Robert Lewandowski scored twice for Barça, while Lamine Yamal added another. Vinícius Júnior and Jude Bellingham scored for Real Madrid, but it wasn't enough to prevent defeat.\",\n",
        "            },\n",
        "            {\n",
        "                \"match_id\": \"LL2023-002\",\n",
        "                \"home_team\": \"ATM\",\n",
        "                \"away_team\": \"BAR\",\n",
        "                \"home_score\": 1,\n",
        "                \"away_score\": 2,\n",
        "                \"date\": (now - timedelta(days=11)).strftime(\"%Y-%m-%d\"),\n",
        "                \"competition\": \"La Liga\",\n",
        "                \"season\": \"2023-2024\",\n",
        "                \"stadium\": \"Metropolitano\",\n",
        "                \"summary\": \"Barcelona came from behind to beat Atletico Madrid 2-1 at the Metropolitano. Antoine Griezmann gave Atletico the lead in the first half, but goals from Pedri and Robert Lewandowski in the second half secured the win for Barcelona.\",\n",
        "            },\n",
        "        ]\n",
        "    )\n",
        "\n",
        "    # Other league matches\n",
        "    matches.extend(\n",
        "        [\n",
        "            {\n",
        "                \"match_id\": \"BL2023-001\",\n",
        "                \"home_team\": \"BAY\",\n",
        "                \"away_team\": \"BVB\",\n",
        "                \"home_score\": 4,\n",
        "                \"away_score\": 0,\n",
        "                \"date\": (now - timedelta(days=5)).strftime(\"%Y-%m-%d\"),\n",
        "                \"competition\": \"Bundesliga\",\n",
        "                \"season\": \"2023-2024\",\n",
        "                \"stadium\": \"Allianz Arena\",\n",
        "                \"summary\": \"Bayern Munich thrashed Borussia Dortmund 4-0 in Der Klassiker at the Allianz Arena. Harry Kane scored a hat-trick, while Leroy Sané added another as Bayern dominated from start to finish.\",\n",
        "            },\n",
        "            {\n",
        "                \"match_id\": \"SA2023-001\",\n",
        "                \"home_team\": \"JUV\",\n",
        "                \"away_team\": \"INT\",\n",
        "                \"home_score\": 1,\n",
        "                \"away_score\": 1,\n",
        "                \"date\": (now - timedelta(days=6)).strftime(\"%Y-%m-%d\"),\n",
        "                \"competition\": \"Serie A\",\n",
        "                \"season\": \"2023-2024\",\n",
        "                \"stadium\": \"Allianz Stadium\",\n",
        "                \"summary\": \"Juventus and Inter Milan shared the points in a 1-1 draw in the Derby d'Italia. Dusan Vlahovic put Juventus ahead in the first half, but Lautaro Martínez equalized for Inter in the second half.\",\n",
        "            },\n",
        "        ]\n",
        "    )\n",
        "\n",
        "    # Generate sample news stories\n",
        "    news = [\n",
        "        {\n",
        "            \"news_id\": \"NEWS001\",\n",
        "            \"title\": \"Manchester United's Bruno Fernandes wins Player of the Month\",\n",
        "            \"date\": (now - timedelta(days=1)).strftime(\"%Y-%m-%d\"),\n",
        "            \"content\": \"Manchester United captain Bruno Fernandes has been named Premier League Player of the Month for his outstanding performances. The Portuguese midfielder scored 4 goals and provided 3 assists in 5 matches, helping United climb up the table. This is Fernandes' 5th Player of the Month award since joining United in January 2020.\",\n",
        "            \"teams\": [\"MNU\"],\n",
        "            \"players\": [\"Bruno Fernandes\"],\n",
        "            \"category\": \"Award\",\n",
        "        },\n",
        "        {\n",
        "            \"news_id\": \"NEWS002\",\n",
        "            \"title\": \"Liverpool suffer injury blow as Salah ruled out for three weeks\",\n",
        "            \"date\": now.strftime(\"%Y-%m-%d\"),\n",
        "            \"content\": \"Liverpool have been dealt a major injury blow with the news that Mohamed Salah will be sidelined for three weeks with a hamstring strain. The Egyptian forward picked up the injury during Liverpool's 2-1 defeat to Manchester United and is expected to miss crucial matches against Arsenal and Manchester City. Manager Jürgen Klopp described the injury as 'unfortunate timing' as Liverpool enter a busy period of fixtures.\",\n",
        "            \"teams\": [\"LIV\", \"MNU\"],\n",
        "            \"players\": [\"Mohamed Salah\"],\n",
        "            \"category\": \"Injury\",\n",
        "        },\n",
        "        {\n",
        "            \"news_id\": \"NEWS003\",\n",
        "            \"title\": \"Barcelona's Lamine Yamal becomes youngest El Clásico goalscorer\",\n",
        "            \"date\": (now - timedelta(days=4)).strftime(\"%Y-%m-%d\"),\n",
        "            \"content\": \"Barcelona wonderkid Lamine Yamal has made history by becoming the youngest ever goalscorer in El Clásico at just 16 years and 107 days old. The Spanish teenager scored a spectacular long-range goal in Barcelona's 3-2 victory over Real Madrid at Camp Nou. 'It's a dream come true,' said Yamal after the match. 'I've been watching El Clásico since I was a child, and to score in this fixture is incredible.'\",\n",
        "            \"teams\": [\"BAR\", \"RMA\"],\n",
        "            \"players\": [\"Lamine Yamal\"],\n",
        "            \"category\": \"Record\",\n",
        "        },\n",
        "        {\n",
        "            \"news_id\": \"NEWS004\",\n",
        "            \"title\": \"Manchester City's Erling Haaland on track to break Premier League scoring record\",\n",
        "            \"date\": (now - timedelta(days=2)).strftime(\"%Y-%m-%d\"),\n",
        "            \"content\": \"Manchester City striker Erling Haaland is on course to break his own Premier League scoring record this season. The Norwegian has already netted 15 goals in just 10 matches, putting him ahead of his record-breaking pace from last season when he scored 36 goals. Pep Guardiola praised Haaland's incredible form: 'What he's doing is remarkable. His hunger for goals is insatiable.'\",\n",
        "            \"teams\": [\"MNC\"],\n",
        "            \"players\": [\"Erling Haaland\"],\n",
        "            \"category\": \"Performance\",\n",
        "        },\n",
        "        {\n",
        "            \"news_id\": \"NEWS005\",\n",
        "            \"title\": \"Bayern Munich's Harry Kane scores perfect hat-trick in Der Klassiker\",\n",
        "            \"date\": (now - timedelta(days=5)).strftime(\"%Y-%m-%d\"),\n",
        "            \"content\": \"Harry Kane scored a perfect hat-trick (right foot, left foot, header) as Bayern Munich demolished Borussia Dortmund 4-0 in Der Klassiker. The England captain has made a sensational start to his Bundesliga career since his summer move from Tottenham Hotspur. 'I'm loving my time here in Munich,' said Kane. 'The team is incredible and we're playing some fantastic football.'\",\n",
        "            \"teams\": [\"BAY\", \"BVB\"],\n",
        "            \"players\": [\"Harry Kane\"],\n",
        "            \"category\": \"Performance\",\n",
        "        },\n",
        "    ]\n",
        "\n",
        "    # Clear existing data\n",
        "    teams_collection.delete_many({})\n",
        "    matches_collection.delete_many({})\n",
        "    news_collection.delete_many({})\n",
        "\n",
        "    # Insert sample data\n",
        "    teams_collection.insert_many(teams)\n",
        "    matches_collection.insert_many(matches)\n",
        "    news_collection.insert_many(news)\n",
        "\n",
        "    print(\n",
        "        f\"Inserted {len(teams)} teams, {len(matches)} matches, and {len(news)} news stories\"\n",
        "    )\n",
        "\n",
        "    return teams, matches, news\n",
        "\n",
        "\n",
        "# Generate sample data\n",
        "teams, matches, news = generate_sample_data()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dZ9WiBa1Bfma"
      },
      "source": [
        "## Data Processing and Embedding Generation\n",
        "\n",
        "Now let's define functions to process our sports data and generate embeddings."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5SyubsVVBfma"
      },
      "outputs": [],
      "source": [
        "def generate_text_for_embedding(item, item_type):\n",
        "    \"\"\"Create a text representation for embedding based on the item type\"\"\"\n",
        "    if item_type == \"match\":\n",
        "        # Get team names for readability\n",
        "        home_team = next(\n",
        "            (team[\"name\"] for team in teams if team[\"team_id\"] == item[\"home_team\"]),\n",
        "            item[\"home_team\"],\n",
        "        )\n",
        "        away_team = next(\n",
        "            (team[\"name\"] for team in teams if team[\"team_id\"] == item[\"away_team\"]),\n",
        "            item[\"away_team\"],\n",
        "        )\n",
        "\n",
        "        text_parts = [\n",
        "            f\"Match: {home_team} vs {away_team}\",\n",
        "            f\"Score: {item['home_score']}-{item['away_score']}\",\n",
        "            f\"Competition: {item['competition']} {item['season']}\",\n",
        "            f\"Date: {item['date']}\",\n",
        "            f\"Stadium: {item['stadium']}\",\n",
        "            f\"Summary: {item['summary']}\",\n",
        "        ]\n",
        "        return \" \".join(text_parts)\n",
        "\n",
        "    elif item_type == \"team\":\n",
        "        text_parts = [\n",
        "            f\"Team: {item['name']}\",\n",
        "            f\"Also known as: {', '.join(item['nicknames'])}\",\n",
        "            f\"League: {item['league']}\",\n",
        "            f\"Country: {item['country']}\",\n",
        "        ]\n",
        "        return \" \".join(text_parts)\n",
        "\n",
        "    elif item_type == \"news\":\n",
        "        text_parts = [\n",
        "            f\"Title: {item['title']}\",\n",
        "            f\"Date: {item['date']}\",\n",
        "            f\"Category: {item['category']}\",\n",
        "            f\"Content: {item['content']}\",\n",
        "        ]\n",
        "        return \" \".join(text_parts)\n",
        "\n",
        "    return \"\"\n",
        "\n",
        "\n",
        "def create_and_save_embeddings():\n",
        "    \"\"\"Generate and save embeddings for all sports data\"\"\"\n",
        "    print(\"Generating embeddings for sports data...\")\n",
        "\n",
        "    # Initialize VoyageAI embeddings\n",
        "    voyage_embeddings = VoyageAIEmbeddings(api_key=VOYAGE_API_KEY)\n",
        "\n",
        "    # Clear existing vector data\n",
        "    vector_collection.delete_many({})\n",
        "\n",
        "    # Process teams\n",
        "    team_texts = [generate_text_for_embedding(team, \"team\") for team in teams]\n",
        "    team_embeddings = voyage_embeddings.embed_batch(team_texts)\n",
        "\n",
        "    # Process matches\n",
        "    match_texts = [generate_text_for_embedding(match, \"match\") for match in matches]\n",
        "    match_embeddings = voyage_embeddings.embed_batch(match_texts)\n",
        "\n",
        "    # Process news\n",
        "    news_texts = [generate_text_for_embedding(news_item, \"news\") for news_item in news]\n",
        "    news_embeddings = voyage_embeddings.embed_batch(news_texts)\n",
        "\n",
        "    # Create records with embeddings\n",
        "    vector_records = []\n",
        "\n",
        "    # Add team embeddings\n",
        "    for i, team in enumerate(teams):\n",
        "        vector_records.append(\n",
        "            {\n",
        "                \"object_id\": team[\"team_id\"],\n",
        "                \"object_type\": \"team\",\n",
        "                \"name\": team[\"name\"],\n",
        "                \"league\": team[\"league\"],\n",
        "                \"country\": team[\"country\"],\n",
        "                \"embedding\": team_embeddings[i],\n",
        "                \"data\": team,\n",
        "            }\n",
        "        )\n",
        "\n",
        "    # Add match embeddings\n",
        "    for i, match in enumerate(matches):\n",
        "        vector_records.append(\n",
        "            {\n",
        "                \"object_id\": match[\"match_id\"],\n",
        "                \"object_type\": \"match\",\n",
        "                \"home_team\": match[\"home_team\"],\n",
        "                \"away_team\": match[\"away_team\"],\n",
        "                \"competition\": match[\"competition\"],\n",
        "                \"date\": match[\"date\"],\n",
        "                \"embedding\": match_embeddings[i],\n",
        "                \"data\": match,\n",
        "            }\n",
        "        )\n",
        "\n",
        "    # Add news embeddings\n",
        "    for i, news_item in enumerate(news):\n",
        "        vector_records.append(\n",
        "            {\n",
        "                \"object_id\": news_item[\"news_id\"],\n",
        "                \"object_type\": \"news\",\n",
        "                \"title\": news_item[\"title\"],\n",
        "                \"date\": news_item[\"date\"],\n",
        "                \"category\": news_item[\"category\"],\n",
        "                \"embedding\": news_embeddings[i],\n",
        "                \"data\": news_item,\n",
        "            }\n",
        "        )\n",
        "\n",
        "    # Insert all records\n",
        "    vector_collection.insert_many(vector_records)\n",
        "    print(f\"Saved {len(vector_records)} embedding records to MongoDB\")\n",
        "\n",
        "    return vector_records"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "itwH31F_Bfma"
      },
      "outputs": [],
      "source": [
        "def create_vector_search_index():\n",
        "    \"\"\"Create a vector search index in MongoDB Atlas\"\"\"\n",
        "\n",
        "    print(\"Setting up Vector Search Index in MongoDB Atlas...\")\n",
        "    print(\"Note: To create the vector search index in MongoDB Atlas:\")\n",
        "    print(\"1. Go to the MongoDB Atlas dashboard\")\n",
        "    print(\"2. Select your cluster\")\n",
        "    print(\"3. Go to the 'Search' tab\")\n",
        "    print(\n",
        "        f\"4. Create a new index on '{VECTOR_COLLECTION}'with the following configuration:\"\n",
        "    )\n",
        "    print(\"\"\"\n",
        "   {\n",
        "  \"fields\": [\n",
        "    {\n",
        "      \"type\": \"vector\",\n",
        "      \"path\": \"embedding\",\n",
        "      \"numDimensions\": 1024,\n",
        "      \"similarity\": \"cosine\"\n",
        "    }\n",
        "  ]\n",
        "}\n",
        "    \"\"\")\n",
        "    print(f\"Name the index: {ATLAS_VECTOR_SEARCH_INDEX_NAME}\")\n",
        "    print(\"5. Apply the index to the vector_features collection\")\n",
        "\n",
        "\n",
        "def perform_vector_search(query_text, k=5):\n",
        "    \"\"\"Perform a vector search query using VoyageAI embeddings\"\"\"\n",
        "    print(f\"Performing vector search for: {query_text}\")\n",
        "\n",
        "    # Generate embedding for the query\n",
        "    voyage_embeddings = VoyageAIEmbeddings(api_key=VOYAGE_API_KEY)\n",
        "    query_embedding = voyage_embeddings.client.embed(\n",
        "        [query_text], model=voyage_embeddings.model, input_type=\"query\"\n",
        "    ).embeddings[0]\n",
        "\n",
        "    # Perform vector search\n",
        "    vector_search_results = vector_collection.aggregate(\n",
        "        [\n",
        "            {\n",
        "                \"$vectorSearch\": {\n",
        "                    \"index\": ATLAS_VECTOR_SEARCH_INDEX_NAME,\n",
        "                    \"path\": \"embedding\",\n",
        "                    \"queryVector\": query_embedding,\n",
        "                    \"numCandidates\": 100,\n",
        "                    \"limit\": k,\n",
        "                }\n",
        "            },\n",
        "            {\n",
        "                \"$project\": {\n",
        "                    \"object_id\": 1,\n",
        "                    \"object_type\": 1,\n",
        "                    \"name\": 1,\n",
        "                    \"title\": 1,\n",
        "                    \"competition\": 1,\n",
        "                    \"date\": 1,\n",
        "                    \"data\": 1,\n",
        "                    \"score\": {\"$meta\": \"vectorSearchScore\"},\n",
        "                }\n",
        "            },\n",
        "        ]\n",
        "    )\n",
        "\n",
        "    results = list(vector_search_results)\n",
        "\n",
        "    print(f\"Found {len(results)} relevant items:\")\n",
        "    for i, result in enumerate(results):\n",
        "        if result[\"object_type\"] == \"team\":\n",
        "            print(\n",
        "                f\"{i+1}. Team: {result.get('name', 'Unknown')} (Score: {result.get('score', 0):.4f})\"\n",
        "            )\n",
        "        elif result[\"object_type\"] == \"match\":\n",
        "            home = result.get(\"data\", {}).get(\"home_team\", \"Unknown\")\n",
        "            away = result.get(\"data\", {}).get(\"away_team\", \"Unknown\")\n",
        "            score = f\"{result.get('data', {}).get('home_score', 0)}-{result.get('data', {}).get('away_score', 0)}\"\n",
        "            print(\n",
        "                f\"{i+1}. Match: {home} vs {away} ({score}) (Score: {result.get('score', 0):.4f})\"\n",
        "            )\n",
        "        elif result[\"object_type\"] == \"news\":\n",
        "            print(\n",
        "                f\"{i+1}. News: {result.get('title', 'Unknown')} (Score: {result.get('score', 0):.4f})\"\n",
        "            )\n",
        "\n",
        "    return results"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "LWaG8AgOBfma",
        "outputId": "699e4fd7-b2e6-47af-9acc-7463c781a9b1"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Generating embeddings for sports data...\n",
            "Processed 15/15 embeddings\n",
            "Processed 7/7 embeddings\n",
            "Processed 5/5 embeddings\n",
            "Saved 27 embedding records to MongoDB\n",
            "Setting up Vector Search Index in MongoDB Atlas...\n",
            "Note: To create the vector search index in MongoDB Atlas:\n",
            "1. Go to the MongoDB Atlas dashboard\n",
            "2. Select your cluster\n",
            "3. Go to the 'Search' tab\n",
            "4. Create a new index on 'vector_features'with the following configuration:\n",
            "\n",
            "   {\n",
            "  \"fields\": [\n",
            "    {\n",
            "      \"type\": \"vector\",\n",
            "      \"path\": \"embedding\",\n",
            "      \"numDimensions\": 1024,\n",
            "      \"similarity\": \"cosine\"\n",
            "    }\n",
            "  ]\n",
            "}\n",
            "    \n",
            "Name the index: voyage_vector_index\n",
            "5. Apply the index to the vector_features collection\n"
          ]
        }
      ],
      "source": [
        "# Create embeddings and save them to MongoDB\n",
        "vector_records = create_and_save_embeddings()\n",
        "\n",
        "# Create a vector search index (this will provide instructions -\n",
        "# actual index creation must be done in MongoDB Atlas UI)\n",
        "create_vector_search_index()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "M8g7iIX3C8Dk",
        "outputId": "8c1240ea-dea7-46fe-c8a3-c390b644b0b2"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Testing vector search with example queries:\n",
            "\n",
            "==================================================\n",
            "QUERY: Recent Manchester United games\n",
            "==================================================\n",
            "Performing vector search for: Recent Manchester United games\n",
            "Found 10 relevant items:\n",
            "1. Team: Manchester United (Score: 0.7876)\n",
            "2. Match: MNU vs LIV (2-1) (Score: 0.7315)\n",
            "3. News: Manchester United's Bruno Fernandes wins Player of the Month (Score: 0.7312)\n",
            "4. Team: Manchester City (Score: 0.7214)\n",
            "5. Team: Chelsea (Score: 0.6717)\n",
            "6. News: Manchester City's Erling Haaland on track to break Premier League scoring record (Score: 0.6715)\n",
            "7. Match: ARS vs MNC (1-1) (Score: 0.6690)\n",
            "8. Team: Tottenham Hotspur (Score: 0.6638)\n",
            "9. Team: Atletico Madrid (Score: 0.6635)\n",
            "10. Team: Arsenal (Score: 0.6631)\n",
            "\n",
            "==================================================\n",
            "QUERY: The Red Devils, how did they do?\n",
            "==================================================\n",
            "Performing vector search for: The Red Devils, how did they do?\n",
            "Found 10 relevant items:\n",
            "1. Team: Manchester United (Score: 0.6628)\n",
            "2. Team: Borussia Dortmund (Score: 0.6567)\n",
            "3. Team: Juventus (Score: 0.6364)\n",
            "4. Match: JUV vs INT (1-1) (Score: 0.6277)\n",
            "5. Team: Bayern Munich (Score: 0.6154)\n",
            "6. Team: Liverpool (Score: 0.6116)\n",
            "7. Team: Paris Saint-Germain (Score: 0.6052)\n",
            "8. Team: Manchester City (Score: 0.6021)\n",
            "9. Match: ARS vs MNC (1-1) (Score: 0.6014)\n",
            "10. Team: AC Milan (Score: 0.6007)\n",
            "\n",
            "==================================================\n",
            "QUERY: Who won El Clasico?\n",
            "==================================================\n",
            "Performing vector search for: Who won El Clasico?\n",
            "Found 10 relevant items:\n",
            "1. News: Barcelona's Lamine Yamal becomes youngest El Clásico goalscorer (Score: 0.7120)\n",
            "2. Match: BAR vs RMA (3-2) (Score: 0.7113)\n",
            "3. Team: Real Madrid (Score: 0.6963)\n",
            "4. Team: Atletico Madrid (Score: 0.6953)\n",
            "5. Match: ATM vs BAR (1-2) (Score: 0.6768)\n",
            "6. News: Bayern Munich's Harry Kane scores perfect hat-trick in Der Klassiker (Score: 0.6362)\n",
            "7. Team: Barcelona (Score: 0.6337)\n",
            "8. Team: AC Milan (Score: 0.6280)\n",
            "9. Team: Inter Milan (Score: 0.6269)\n",
            "10. Match: BAY vs BVB (4-0) (Score: 0.6234)\n",
            "\n",
            "==================================================\n",
            "QUERY: Premier League match results\n",
            "==================================================\n",
            "Performing vector search for: Premier League match results\n",
            "Found 10 relevant items:\n",
            "1. Team: Tottenham Hotspur (Score: 0.7127)\n",
            "2. Team: Chelsea (Score: 0.6972)\n",
            "3. Team: Manchester City (Score: 0.6942)\n",
            "4. Match: ARS vs MNC (1-1) (Score: 0.6912)\n",
            "5. Team: Liverpool (Score: 0.6910)\n",
            "6. Team: Arsenal (Score: 0.6883)\n",
            "7. Team: Manchester United (Score: 0.6875)\n",
            "8. Match: MNU vs LIV (2-1) (Score: 0.6852)\n",
            "9. Match: CHE vs TOT (3-0) (Score: 0.6846)\n",
            "10. News: Bayern Munich's Harry Kane scores perfect hat-trick in Der Klassiker (Score: 0.6694)\n",
            "\n",
            "==================================================\n",
            "QUERY: Player injuries news\n",
            "==================================================\n",
            "Performing vector search for: Player injuries news\n",
            "Found 10 relevant items:\n",
            "1. News: Liverpool suffer injury blow as Salah ruled out for three weeks (Score: 0.7018)\n",
            "2. Team: Inter Milan (Score: 0.6357)\n",
            "3. Team: Manchester United (Score: 0.6354)\n",
            "4. Team: Tottenham Hotspur (Score: 0.6344)\n",
            "5. Team: Chelsea (Score: 0.6288)\n",
            "6. Team: Juventus (Score: 0.6286)\n",
            "7. Team: Paris Saint-Germain (Score: 0.6244)\n",
            "8. Team: Real Madrid (Score: 0.6239)\n",
            "9. Team: Atletico Madrid (Score: 0.6221)\n",
            "10. Team: Manchester City (Score: 0.6215)\n",
            "\n",
            "==================================================\n",
            "QUERY: Bayern Munich performance\n",
            "==================================================\n",
            "Performing vector search for: Bayern Munich performance\n",
            "Found 10 relevant items:\n",
            "1. Team: Bayern Munich (Score: 0.8020)\n",
            "2. News: Bayern Munich's Harry Kane scores perfect hat-trick in Der Klassiker (Score: 0.7724)\n",
            "3. Match: BAY vs BVB (4-0) (Score: 0.7520)\n",
            "4. Team: Borussia Dortmund (Score: 0.6945)\n",
            "5. Team: Barcelona (Score: 0.6800)\n",
            "6. Team: Real Madrid (Score: 0.6786)\n",
            "7. Team: Paris Saint-Germain (Score: 0.6771)\n",
            "8. Match: ATM vs BAR (1-2) (Score: 0.6743)\n",
            "9. Team: Inter Milan (Score: 0.6734)\n",
            "10. Team: Atletico Madrid (Score: 0.6693)\n"
          ]
        }
      ],
      "source": [
        "# Example search queries to test our vector search\n",
        "example_queries = [\n",
        "    \"Recent Manchester United games\",\n",
        "    \"The Red Devils, how did they do?\",\n",
        "    \"Who won El Clasico?\",\n",
        "    \"Premier League match results\",\n",
        "    \"Player injuries news\",\n",
        "    \"Bayern Munich performance\",\n",
        "]\n",
        "\n",
        "print(\"Testing vector search with example queries:\")\n",
        "for query in example_queries:\n",
        "    print(\"\\n\" + \"=\" * 50)\n",
        "    print(f\"QUERY: {query}\")\n",
        "    print(\"=\" * 50)\n",
        "    results = perform_vector_search(query, k=10)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "znE3zwX5Sjci"
      },
      "source": [
        "## Hybrid Search\n",
        "\n",
        "[Hybrid Search](https://www.mongodb.com/docs/atlas/atlas-vector-search/tutorials/reciprocal-rank-fusion/) allows combination of full text search for text token matching with vector search for semantic mapping."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "k4UbHWU-Smcc"
      },
      "outputs": [],
      "source": [
        "## Create FTS\n",
        "\n",
        "\n",
        "def create_full_search_index():\n",
        "    \"\"\"Create a fulltext search index in MongoDB Atlas\"\"\"\n",
        "\n",
        "    print(\"Setting up Search Index in MongoDB Atlas...\")\n",
        "    print(\"Note: To create the vector search index in MongoDB Atlas:\")\n",
        "    print(\"1. Go to the MongoDB Atlas dashboard\")\n",
        "    print(\"2. Select your cluster\")\n",
        "    print(\"3. Go to the 'Search' tab\")\n",
        "    print(\n",
        "        f\"4. Create a new 'Search' index on '{VECTOR_COLLECTION}'with the following configuration:\"\n",
        "    )\n",
        "    print(\"\"\"\n",
        "   {\n",
        "  \"mappings\": {\n",
        "    \"dynamic\": true,\n",
        "    }\n",
        "  }\n",
        "}\n",
        "    \"\"\")\n",
        "    print(\"Name the index: default\")\n",
        "    print(\"5. Apply the index to the vector_features collection\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "id": "40iyYjCmWEWg"
      },
      "outputs": [],
      "source": [
        "def hybrid_search(query, limit=5, vector_weight=0.5, full_text_weight=0.5):\n",
        "    \"\"\"Perform a hybrid search using vector search and full-text search.\"\"\"\n",
        "\n",
        "    voyage_embeddings = VoyageAIEmbeddings(api_key=VOYAGE_API_KEY)\n",
        "    query_embedding = voyage_embeddings.client.embed(\n",
        "        [query], model=voyage_embeddings.model, input_type=\"query\"\n",
        "    ).embeddings[0]\n",
        "\n",
        "    pipeline = [\n",
        "        {\n",
        "            \"$vectorSearch\": {\n",
        "                \"index\": ATLAS_VECTOR_SEARCH_INDEX_NAME,\n",
        "                \"path\": \"embedding\",\n",
        "                \"queryVector\": query_embedding,\n",
        "                \"numCandidates\": 100,\n",
        "                \"limit\": limit * 2,  # Get more results for potential ranking\n",
        "            }\n",
        "        },\n",
        "        {\"$group\": {\"_id\": None, \"docs\": {\"$push\": \"$$ROOT\"}}},\n",
        "        {\"$unwind\": {\"path\": \"$docs\", \"includeArrayIndex\": \"rank\"}},\n",
        "        {\n",
        "            \"$addFields\": {\n",
        "                \"vs_score\": {\n",
        "                    \"$multiply\": [\n",
        "                        vector_weight,\n",
        "                        {\n",
        "                            \"$divide\": [\n",
        "                                1.0,\n",
        "                                {\n",
        "                                    \"$add\": [\"$rank\", 60]  # Adjust ranking\n",
        "                                },\n",
        "                            ]\n",
        "                        },\n",
        "                    ]\n",
        "                }\n",
        "            }\n",
        "        },\n",
        "        {\n",
        "            \"$project\": {\n",
        "                \"vs_score\": 1,\n",
        "                \"_id\": \"$docs._id\",\n",
        "                \"title\": \"$docs.title\",\n",
        "                \"object_type\": \"$docs.object_type\",\n",
        "                \"data\": \"$docs.data\",\n",
        "            }\n",
        "        },\n",
        "        {\n",
        "            \"$unionWith\": {\n",
        "                \"coll\": VECTOR_COLLECTION,\n",
        "                \"pipeline\": [\n",
        "                    {\n",
        "                        \"$search\": {\n",
        "                            \"index\": \"default\",\n",
        "                            \"compound\": {\n",
        "                                \"must\": [\n",
        "                                    {\n",
        "                                        \"text\": {\n",
        "                                            \"query\": query,\n",
        "                                            \"path\": {\"wildcard\": \"*\"},\n",
        "                                            \"fuzzy\": {},\n",
        "                                        }\n",
        "                                    }\n",
        "                                ]\n",
        "                            },\n",
        "                        }\n",
        "                    },\n",
        "                    {\"$limit\": limit * 2},\n",
        "                    {\"$group\": {\"_id\": None, \"docs\": {\"$push\": \"$$ROOT\"}}},\n",
        "                    {\"$unwind\": {\"path\": \"$docs\", \"includeArrayIndex\": \"fts_rank\"}},\n",
        "                    {\n",
        "                        \"$addFields\": {\n",
        "                            \"fts_score\": {\n",
        "                                \"$multiply\": [\n",
        "                                    full_text_weight,\n",
        "                                    {\"$divide\": [1.0, {\"$add\": [\"$fts_rank\", 60]}]},\n",
        "                                ]\n",
        "                            }\n",
        "                        }\n",
        "                    },\n",
        "                    {\n",
        "                        \"$project\": {\n",
        "                            \"fts_score\": 1,\n",
        "                            \"_id\": \"$docs._id\",\n",
        "                            \"title\": \"$docs.title\",\n",
        "                            \"object_type\": \"$docs.object_type\",\n",
        "                            \"data\": \"$docs.data\",\n",
        "                        }\n",
        "                    },\n",
        "                ],\n",
        "            }\n",
        "        },\n",
        "        {\n",
        "            \"$addFields\": {\n",
        "                \"final_score\": {\n",
        "                    \"$add\": [\n",
        "                        {\"$ifNull\": [\"$vs_score\", 0]},  # Handle missing vs_score\n",
        "                        {\"$ifNull\": [\"$fts_score\", 0]},  # Handle missing fts_score\n",
        "                    ]\n",
        "                }\n",
        "            }\n",
        "        },\n",
        "        {\"$sort\": {\"final_score\": -1}},\n",
        "        {\"$limit\": limit},\n",
        "    ]\n",
        "\n",
        "    results = list(vector_collection.aggregate(pipeline))\n",
        "\n",
        "    print(f\"Found {len(results)} relevant items:\")\n",
        "    for i, result in enumerate(results):\n",
        "        if result[\"object_type\"] == \"team\":\n",
        "            print(\n",
        "                f\"{i+1}. Team: {result.get('data', {}).get('name', 'Unknown')} (Score: {result.get('final_score', 0):.4f})\"\n",
        "            )\n",
        "        elif result[\"object_type\"] == \"match\":\n",
        "            home = result.get(\"data\", {}).get(\"home_team\", \"Unknown\")\n",
        "            away = result.get(\"data\", {}).get(\"away_team\", \"Unknown\")\n",
        "            score = f\"{result.get('data', {}).get('home_score', 0)}-{result.get('data', {}).get('away_score', 0)}\"\n",
        "            print(\n",
        "                f\"{i+1}. Match: {home} vs {away} ({score}) (Score: {result.get('final_score', 0):.4f})\"\n",
        "            )\n",
        "        elif result[\"object_type\"] == \"news\":\n",
        "            print(\n",
        "                f\"{i+1}. News: {result.get('data', {}).get('title', 'Unknown')} (Score: {result.get('final_score', 0):.4f})\"\n",
        "            )\n",
        "\n",
        "    return results"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "KSRA4b64WdIG",
        "outputId": "867e77ea-9337-4cf9-adfb-9451f9bfafab"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Testing vector search with default wieghts example queries:\n",
            "\n",
            "==================================================\n",
            "QUERY: Recent Manchester United games\n",
            "==================================================\n",
            "Found 5 relevant items:\n",
            "1. Team: Manchester United (Score: 0.0083)\n",
            "2. Team: Manchester United (Score: 0.0083)\n",
            "3. News: Manchester United's Bruno Fernandes wins Player of the Month (Score: 0.0082)\n",
            "4. Match: MNU vs LIV (2-1) (Score: 0.0082)\n",
            "5. News: Manchester United's Bruno Fernandes wins Player of the Month (Score: 0.0081)\n",
            "Testing vector search with favor of vector wieghts example queries:\n",
            "\n",
            "==================================================\n",
            "QUERY: The Red Devils, how did they do?\n",
            "==================================================\n",
            "Found 5 relevant items:\n",
            "1. Team: Chelsea (Score: 0.0083)\n",
            "2. Team: Manchester United (Score: 0.0083)\n",
            "3. Team: Liverpool (Score: 0.0082)\n",
            "4. Team: Borussia Dortmund (Score: 0.0082)\n",
            "5. Team: Juventus (Score: 0.0081)\n",
            "Testing vector search with favor of vector wieghts example queries:\n",
            "\n",
            "==================================================\n",
            "QUERY: Who won El Clasico?\n",
            "==================================================\n",
            "Found 5 relevant items:\n",
            "1. News: Barcelona's Lamine Yamal becomes youngest El Clásico goalscorer (Score: 0.0083)\n",
            "2. News: Barcelona's Lamine Yamal becomes youngest El Clásico goalscorer (Score: 0.0083)\n",
            "3. Match: CHE vs TOT (3-0) (Score: 0.0082)\n",
            "4. Match: BAR vs RMA (3-2) (Score: 0.0082)\n",
            "5. Team: Real Madrid (Score: 0.0081)\n",
            "Testing vector search with favor of vector wieghts example queries:\n",
            "\n",
            "==================================================\n",
            "QUERY: Premier League match results\n",
            "==================================================\n",
            "Found 5 relevant items:\n",
            "1. News: Manchester City's Erling Haaland on track to break Premier League scoring record (Score: 0.0083)\n",
            "2. Team: Tottenham Hotspur (Score: 0.0083)\n",
            "3. Match: CHE vs TOT (3-0) (Score: 0.0082)\n",
            "4. Team: Chelsea (Score: 0.0082)\n",
            "5. Team: Manchester City (Score: 0.0081)\n",
            "Testing vector search with favor of vector wieghts example queries:\n",
            "\n",
            "==================================================\n",
            "QUERY: Player injuries news\n",
            "==================================================\n",
            "Found 5 relevant items:\n",
            "1. News: Manchester United's Bruno Fernandes wins Player of the Month (Score: 0.0083)\n",
            "2. News: Liverpool suffer injury blow as Salah ruled out for three weeks (Score: 0.0083)\n",
            "3. News: Liverpool suffer injury blow as Salah ruled out for three weeks (Score: 0.0082)\n",
            "4. Team: Inter Milan (Score: 0.0082)\n",
            "5. Team: Manchester United (Score: 0.0081)\n",
            "Testing vector search with favor of vector wieghts example queries:\n",
            "\n",
            "==================================================\n",
            "QUERY: Bayern Munich performance\n",
            "==================================================\n",
            "Found 5 relevant items:\n",
            "1. News: Bayern Munich's Harry Kane scores perfect hat-trick in Der Klassiker (Score: 0.0083)\n",
            "2. Team: Bayern Munich (Score: 0.0083)\n",
            "3. Team: Bayern Munich (Score: 0.0082)\n",
            "4. News: Bayern Munich's Harry Kane scores perfect hat-trick in Der Klassiker (Score: 0.0082)\n",
            "5. Match: BAY vs BVB (4-0) (Score: 0.0081)\n",
            "Testing vector search with favor of vector wieghts example queries:\n",
            "\n",
            "==================================================\n",
            "QUERY: Recent Manchester United games\n",
            "==================================================\n",
            "Found 5 relevant items:\n",
            "1. Team: Manchester United (Score: 0.0150)\n",
            "2. Match: MNU vs LIV (2-1) (Score: 0.0148)\n",
            "3. News: Manchester United's Bruno Fernandes wins Player of the Month (Score: 0.0145)\n",
            "4. Team: Manchester City (Score: 0.0143)\n",
            "5. Team: Chelsea (Score: 0.0141)\n",
            "\n",
            "==================================================\n",
            "QUERY: The Red Devils, how did they do?\n",
            "==================================================\n",
            "Found 5 relevant items:\n",
            "1. Team: Manchester United (Score: 0.0150)\n",
            "2. Team: Borussia Dortmund (Score: 0.0148)\n",
            "3. Team: Juventus (Score: 0.0145)\n",
            "4. Match: JUV vs INT (1-1) (Score: 0.0143)\n",
            "5. Team: Bayern Munich (Score: 0.0141)\n",
            "\n",
            "==================================================\n",
            "QUERY: Who won El Clasico?\n",
            "==================================================\n",
            "Found 5 relevant items:\n",
            "1. News: Barcelona's Lamine Yamal becomes youngest El Clásico goalscorer (Score: 0.0150)\n",
            "2. Match: BAR vs RMA (3-2) (Score: 0.0148)\n",
            "3. Team: Real Madrid (Score: 0.0145)\n",
            "4. Team: Atletico Madrid (Score: 0.0143)\n",
            "5. Match: ATM vs BAR (1-2) (Score: 0.0141)\n",
            "\n",
            "==================================================\n",
            "QUERY: Premier League match results\n",
            "==================================================\n",
            "Found 5 relevant items:\n",
            "1. Team: Tottenham Hotspur (Score: 0.0150)\n",
            "2. Team: Chelsea (Score: 0.0148)\n",
            "3. Team: Manchester City (Score: 0.0145)\n",
            "4. Match: ARS vs MNC (1-1) (Score: 0.0143)\n",
            "5. Team: Liverpool (Score: 0.0141)\n",
            "\n",
            "==================================================\n",
            "QUERY: Player injuries news\n",
            "==================================================\n",
            "Found 5 relevant items:\n",
            "1. News: Liverpool suffer injury blow as Salah ruled out for three weeks (Score: 0.0150)\n",
            "2. Team: Inter Milan (Score: 0.0148)\n",
            "3. Team: Manchester United (Score: 0.0145)\n",
            "4. Team: Tottenham Hotspur (Score: 0.0143)\n",
            "5. Team: Chelsea (Score: 0.0141)\n",
            "\n",
            "==================================================\n",
            "QUERY: Bayern Munich performance\n",
            "==================================================\n",
            "Found 5 relevant items:\n",
            "1. Team: Bayern Munich (Score: 0.0150)\n",
            "2. News: Bayern Munich's Harry Kane scores perfect hat-trick in Der Klassiker (Score: 0.0148)\n",
            "3. Match: BAY vs BVB (4-0) (Score: 0.0145)\n",
            "4. Team: Borussia Dortmund (Score: 0.0143)\n",
            "5. Team: Barcelona (Score: 0.0141)\n"
          ]
        }
      ],
      "source": [
        "# Example search queries to test our hybrid search\n",
        "example_queries = [\n",
        "    \"Recent Manchester United games\",\n",
        "    \"The Red Devils, how did they do?\",\n",
        "    \"Who won El Clasico?\",\n",
        "    \"Premier League match results\",\n",
        "    \"Player injuries news\",\n",
        "    \"Bayern Munich performance\",\n",
        "]\n",
        "\n",
        "print(\"Testing vector search with default wieghts example queries:\")\n",
        "for query in example_queries:\n",
        "    print(\"\\n\" + \"=\" * 50)\n",
        "    print(f\"QUERY: {query}\")\n",
        "    print(\"=\" * 50)\n",
        "    results = hybrid_search(query, limit=5)\n",
        "\n",
        "    print(\"Testing vector search with favor of vector wieghts example queries:\")\n",
        "for query in example_queries:\n",
        "    print(\"\\n\" + \"=\" * 50)\n",
        "    print(f\"QUERY: {query}\")\n",
        "    print(\"=\" * 50)\n",
        "    results = hybrid_search(query, limit=5, vector_weight=0.9, full_text_weight=0.1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9PoVSQPEPxO1"
      },
      "source": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-KaqifBSzgUN"
      },
      "source": [
        "## RAG with OpenAI\n",
        "\n",
        "RAG is a pipeline that loads similarity or hybrid context into an LLM to produce a relevant response considering a specific question."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jUeAx4QIYfsd"
      },
      "outputs": [],
      "source": [
        "from openai import OpenAI\n",
        "\n",
        "client = OpenAI(api_key=OPENAI_API_KEY)\n",
        "\n",
        "\n",
        "def generate_response_with_hybrid_search(query, limit=5):\n",
        "    \"\"\"Generates a response using OpenAI's responses API with hybrid search.\"\"\"\n",
        "\n",
        "    # 1. Perform hybrid search to retrieve relevant documents\n",
        "    search_results = hybrid_search(query, limit=limit)\n",
        "\n",
        "    # 2. Format search results for OpenAI API\n",
        "    context = \"\"\n",
        "    for result in search_results:\n",
        "        if result[\"object_type\"] == \"team\":\n",
        "            context += f\"Team: {result.get('data', {}).get('name', 'Unknown')}\\n\"\n",
        "        elif result[\"object_type\"] == \"match\":\n",
        "            home = result.get(\"data\", {}).get(\"home_team\", \"Unknown\")\n",
        "            away = result.get(\"data\", {}).get(\"away_team\", \"Unknown\")\n",
        "            score = f\"{result.get('data', {}).get('home_score', 0)}-{result.get('data', {}).get('away_score', 0)}\"\n",
        "            context += f\"Match: {home} vs {away} ({score})\\n\"\n",
        "        elif result[\"object_type\"] == \"news\":\n",
        "            context += f\"News: {result.get('data', {}).get('title', 'Unknown')}\\n{result.get('data', {}).get('content', '')}\\n\"\n",
        "\n",
        "    # 3. Call OpenAI API to generate response\n",
        "    response = client.chat.completions.create(\n",
        "        model=\"gpt-4o\",\n",
        "        messages=[\n",
        "            {\n",
        "                \"role\": \"system\",\n",
        "                \"content\": \"You are a helpful sports assistant. Answer the user's query using the provided context.\",\n",
        "            },\n",
        "            {\"role\": \"user\", \"content\": f\"{query}\\n\\nContext:\\n{context}\"},\n",
        "        ],\n",
        "    )\n",
        "\n",
        "    return response.choices[0].message.content\n",
        "\n",
        "\n",
        "def generate_response_with_vector_search(query, limit=5):\n",
        "    \"\"\"Generates a response using OpenAI's responses API with vector search.\"\"\"\n",
        "\n",
        "    # 1. Perform vector search to retrieve relevant documents\n",
        "    search_results = perform_vector_search(query, k=limit)\n",
        "\n",
        "    # 2. Format search results for OpenAI API\n",
        "    context = \"\"\n",
        "    for result in search_results:\n",
        "        if result[\"object_type\"] == \"team\":\n",
        "            context += f\"Team: {result.get('name', 'Unknown')}\\n\"\n",
        "        elif result[\"object_type\"] == \"match\":\n",
        "            home = result.get(\"data\", {}).get(\"home_team\", \"Unknown\")\n",
        "            away = result.get(\"data\", {}).get(\"away_team\", \"Unknown\")\n",
        "            score = f\"{result.get('data', {}).get('home_score', 0)}-{result.get('data', {}).get('away_score', 0)}\"\n",
        "            context += f\"Match: {home} vs {away} ({score})\\n\"\n",
        "        elif result[\"object_type\"] == \"news\":\n",
        "            context += f\"News: {result.get('title', 'Unknown')}\\n{result.get('data', {}).get('content', '')}\\n\"\n",
        "\n",
        "    # 3. Call OpenAI API to generate response\n",
        "    response = client.chat.completions.create(\n",
        "        model=\"gpt-4o\",\n",
        "        messages=[\n",
        "            {\n",
        "                \"role\": \"system\",\n",
        "                \"content\": \"You are a helpful sports assistant. Answer the user's query using the provided context.\",\n",
        "            },\n",
        "            {\"role\": \"user\", \"content\": f\"{query}\\n\\nContext:\\n{context}\"},\n",
        "        ],\n",
        "    )\n",
        "\n",
        "    return response.choices[0].message.content"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "nVEmdISgZ0Tg",
        "outputId": "b95aefe7-a1dd-4024-c9ad-8c4aee83a674"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Testing hybrid search with example queries:\n",
            "==================================================\n",
            "Found 5 relevant items:\n",
            "1. News: Barcelona's Lamine Yamal becomes youngest El Clásico goalscorer (Score: 0.0083)\n",
            "2. News: Barcelona's Lamine Yamal becomes youngest El Clásico goalscorer (Score: 0.0083)\n",
            "3. Match: CHE vs TOT (3-0) (Score: 0.0082)\n",
            "4. Match: BAR vs RMA (3-2) (Score: 0.0082)\n",
            "5. Team: Real Madrid (Score: 0.0081)\n",
            "====================Hybrid RAG====================\n",
            "Response (Hybrid Search): Barcelona won El Clásico, defeating Real Madrid with a score of 3-2 at Camp Nou.\n",
            "\n",
            "Testing vector search with example queries:\n",
            "==================================================\n",
            "Performing vector search for: Who won El Clasico?\n",
            "Found 5 relevant items:\n",
            "1. News: Barcelona's Lamine Yamal becomes youngest El Clásico goalscorer (Score: 0.7120)\n",
            "2. Match: BAR vs RMA (3-2) (Score: 0.7113)\n",
            "3. Team: Real Madrid (Score: 0.6963)\n",
            "4. Team: Atletico Madrid (Score: 0.6953)\n",
            "5. Match: ATM vs BAR (1-2) (Score: 0.6768)\n",
            "====================Vector RAG====================\n",
            "Response (Vector Search): Barcelona won El Clásico against Real Madrid with a 3-2 victory at Camp Nou.\n"
          ]
        }
      ],
      "source": [
        "query = \"Who won El Clasico?\"\n",
        "\n",
        "# Using hybrid search\n",
        "print(\"Testing hybrid search with example queries:\")\n",
        "print(\"=\" * 50)\n",
        "response_hybrid = generate_response_with_hybrid_search(query)\n",
        "\n",
        "print(\"=\" * 20 + \"Hybrid RAG\" + \"=\" * 20)\n",
        "print(\"Response (Hybrid Search):\", response_hybrid)\n",
        "\n",
        "# Using vector search\n",
        "print(\"\\nTesting vector search with example queries:\")\n",
        "print(\"=\" * 50)\n",
        "response_vector = generate_response_with_vector_search(query)\n",
        "print(\"=\" * 20 + \"Vector RAG\" + \"=\" * 20)\n",
        "print(\"Response (Vector Search):\", response_vector)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "vh5qL808BpXi"
      },
      "outputs": [],
      "source": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "a_JfDe_0BU_9"
      },
      "source": [
        "## Agentic RAG with Hybrid Search\n",
        "\n",
        "Here we will use the [openai-agents](https://openai.github.io/openai-agents-python/) sdk to use the \"hybrid_search\" function as a tool. This helps the AI to better tailor the search term we pass to the tools and can perform multiple step tasks."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Mb7queRQ8ARO",
        "outputId": "e2a17285-8df9-401a-f527-0a3ea7833629"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m106.5/106.5 kB\u001b[0m \u001b[31m4.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m129.1/129.1 kB\u001b[0m \u001b[31m4.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m76.1/76.1 kB\u001b[0m \u001b[31m2.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m72.0/72.0 kB\u001b[0m \u001b[31m4.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m62.3/62.3 kB\u001b[0m \u001b[31m2.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h"
          ]
        }
      ],
      "source": [
        "!pip install -Uq openai-agents"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {
        "id": "-wSPNO7o6-NK"
      },
      "outputs": [],
      "source": [
        "OPENAI_MODEL = \"gpt-4o\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "metadata": {
        "id": "h8aKCiMM9y5o"
      },
      "outputs": [],
      "source": [
        "from agents.tool import function_tool\n",
        "\n",
        "\n",
        "@function_tool\n",
        "def hybrid_search(\n",
        "    query: str, limit: int, vector_weight: float, full_text_weight: float\n",
        ") -> list:\n",
        "    \"\"\"Perform a hybrid search using vector search and full-text search.\"\"\"\n",
        "\n",
        "    voyage_embeddings = VoyageAIEmbeddings(api_key=VOYAGE_API_KEY)\n",
        "    query_embedding = voyage_embeddings.client.embed(\n",
        "        [query], model=voyage_embeddings.model, input_type=\"query\"\n",
        "    ).embeddings[0]\n",
        "\n",
        "    pipeline = [\n",
        "        {\n",
        "            \"$vectorSearch\": {\n",
        "                \"index\": ATLAS_VECTOR_SEARCH_INDEX_NAME,\n",
        "                \"path\": \"embedding\",\n",
        "                \"queryVector\": query_embedding,\n",
        "                \"numCandidates\": 100,\n",
        "                \"limit\": limit * 2,  # Get more results for potential ranking\n",
        "            }\n",
        "        },\n",
        "        {\"$group\": {\"_id\": None, \"docs\": {\"$push\": \"$$ROOT\"}}},\n",
        "        {\"$unwind\": {\"path\": \"$docs\", \"includeArrayIndex\": \"rank\"}},\n",
        "        {\n",
        "            \"$addFields\": {\n",
        "                \"vs_score\": {\n",
        "                    \"$multiply\": [\n",
        "                        vector_weight,\n",
        "                        {\n",
        "                            \"$divide\": [\n",
        "                                1.0,\n",
        "                                {\n",
        "                                    \"$add\": [\"$rank\", 60]  # Adjust ranking\n",
        "                                },\n",
        "                            ]\n",
        "                        },\n",
        "                    ]\n",
        "                }\n",
        "            }\n",
        "        },\n",
        "        {\n",
        "            \"$project\": {\n",
        "                \"vs_score\": 1,\n",
        "                \"_id\": \"$docs._id\",\n",
        "                \"title\": \"$docs.title\",\n",
        "                \"object_type\": \"$docs.object_type\",\n",
        "                \"data\": \"$docs.data\",\n",
        "            }\n",
        "        },\n",
        "        {\n",
        "            \"$unionWith\": {\n",
        "                \"coll\": VECTOR_COLLECTION,\n",
        "                \"pipeline\": [\n",
        "                    {\n",
        "                        \"$search\": {\n",
        "                            \"index\": \"default\",\n",
        "                            \"compound\": {\n",
        "                                \"must\": [\n",
        "                                    {\n",
        "                                        \"text\": {\n",
        "                                            \"query\": query,\n",
        "                                            \"path\": {\"wildcard\": \"*\"},\n",
        "                                            \"fuzzy\": {},\n",
        "                                        }\n",
        "                                    }\n",
        "                                ]\n",
        "                            },\n",
        "                        }\n",
        "                    },\n",
        "                    {\"$limit\": limit * 2},\n",
        "                    {\"$group\": {\"_id\": None, \"docs\": {\"$push\": \"$$ROOT\"}}},\n",
        "                    {\"$unwind\": {\"path\": \"$docs\", \"includeArrayIndex\": \"fts_rank\"}},\n",
        "                    {\n",
        "                        \"$addFields\": {\n",
        "                            \"fts_score\": {\n",
        "                                \"$multiply\": [\n",
        "                                    full_text_weight,\n",
        "                                    {\"$divide\": [1.0, {\"$add\": [\"$fts_rank\", 60]}]},\n",
        "                                ]\n",
        "                            }\n",
        "                        }\n",
        "                    },\n",
        "                    {\n",
        "                        \"$project\": {\n",
        "                            \"fts_score\": 1,\n",
        "                            \"_id\": \"$docs._id\",\n",
        "                            \"title\": \"$docs.title\",\n",
        "                            \"object_type\": \"$docs.object_type\",\n",
        "                            \"data\": \"$docs.data\",\n",
        "                        }\n",
        "                    },\n",
        "                ],\n",
        "            }\n",
        "        },\n",
        "        {\n",
        "            \"$addFields\": {\n",
        "                \"final_score\": {\n",
        "                    \"$add\": [\n",
        "                        {\"$ifNull\": [\"$vs_score\", 0]},  # Handle missing vs_score\n",
        "                        {\"$ifNull\": [\"$fts_score\", 0]},  # Handle missing fts_score\n",
        "                    ]\n",
        "                }\n",
        "            }\n",
        "        },\n",
        "        {\"$sort\": {\"final_score\": -1}},\n",
        "        {\"$limit\": limit},\n",
        "    ]\n",
        "\n",
        "    results = list(vector_collection.aggregate(pipeline))\n",
        "\n",
        "    print(f\"Found {len(results)} relevant items:\")\n",
        "    for i, result in enumerate(results):\n",
        "        if result[\"object_type\"] == \"team\":\n",
        "            print(\n",
        "                f\"{i+1}. Team: {result.get('data', {}).get('name', 'Unknown')} (Score: {result.get('final_score', 0):.4f})\"\n",
        "            )\n",
        "        elif result[\"object_type\"] == \"match\":\n",
        "            home = result.get(\"data\", {}).get(\"home_team\", \"Unknown\")\n",
        "            away = result.get(\"data\", {}).get(\"away_team\", \"Unknown\")\n",
        "            score = f\"{result.get('data', {}).get('home_score', 0)}-{result.get('data', {}).get('away_score', 0)}\"\n",
        "            print(\n",
        "                f\"{i+1}. Match: {home} vs {away} ({score}) (Score: {result.get('final_score', 0):.4f})\"\n",
        "            )\n",
        "        elif result[\"object_type\"] == \"news\":\n",
        "            print(\n",
        "                f\"{i+1}. News: {result.get('data', {}).get('title', 'Unknown')} (Score: {result.get('final_score', 0):.4f})\"\n",
        "            )\n",
        "\n",
        "    return results"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 22,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "VAp9tIZjRkcT",
        "outputId": "3e43c305-b30d-405f-ca31-b1598a1ce9fd"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Testing agentic hybrid search with example queries:\n",
            "==================================================\n",
            "\n",
            "==================================================\n",
            "QUERY: Recent Manchester United games\n",
            "==================================================\n",
            "Found 5 relevant items:\n",
            "1. Team: Manchester United (Score: 0.0117)\n",
            "2. Match: MNU vs LIV (2-1) (Score: 0.0115)\n",
            "3. News: Manchester United's Bruno Fernandes wins Player of the Month (Score: 0.0113)\n",
            "4. Team: Manchester City (Score: 0.0111)\n",
            "5. Team: Chelsea (Score: 0.0109)\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "WARNING:openai.agents:OPENAI_API_KEY is not set, skipping trace export\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Here are some of the recent Manchester United games:\n",
            "\n",
            "1. **Against Liverpool**  \n",
            "   Date: March 24, 2025  \n",
            "   Competition: Premier League  \n",
            "   Score: Manchester United 2 - 1 Liverpool  \n",
            "   **Summary:** Manchester United secured a thrilling 2-1 victory over Liverpool at Old Trafford. Bruno Fernandes opened the scoring with a penalty in the 34th minute, before Marcus Rashford doubled the lead with a brilliant solo effort. Mohamed Salah pulled one back for Liverpool, but United held on for a crucial win.\n",
            "\n",
            "Bruno Fernandes has also been in sizzling form, winning the Premier League Player of the Month award for March. He scored 4 goals and provided 3 assists in 5 matches. Go Bruno! 🎉\n",
            "\n",
            "Would you like to know more about any specific game or player? 😊\n",
            "==================================================\n",
            "\n",
            "==================================================\n",
            "QUERY: The Red Devils, how did they do?\n",
            "==================================================\n",
            "Found 5 relevant items:\n",
            "1. Team: Manchester United (Score: 0.0083)\n",
            "2. Team: Manchester United (Score: 0.0083)\n",
            "3. Match: BAR vs RMA (3-2) (Score: 0.0082)\n",
            "4. Team: Borussia Dortmund (Score: 0.0082)\n",
            "5. Team: Liverpool (Score: 0.0081)\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "WARNING:openai.agents:OPENAI_API_KEY is not set, skipping trace export\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "I couldn't find the latest match results for the Red Devils (Manchester United). However, they are known as one of the top teams in the Premier League! Would you like more info or try a different search? ⚽\n",
            "==================================================\n",
            "\n",
            "==================================================\n",
            "QUERY: Who won El Clasico?\n",
            "==================================================\n",
            "Found 1 relevant items:\n",
            "1. News: Barcelona's Lamine Yamal becomes youngest El Clásico goalscorer (Score: 0.0083)\n",
            "Barcelona won the latest El Clásico against Real Madrid with a score of 3-2! Lamine Yamal made history by becoming the youngest goalscorer at just 16 years and 107 days old. How amazing is that? 🎉⚽🎉\n",
            "==================================================\n",
            "\n",
            "==================================================\n",
            "QUERY: Premier League match results\n",
            "==================================================\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "WARNING:openai.agents:OPENAI_API_KEY is not set, skipping trace export\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Found 5 relevant items:\n",
            "1. News: Manchester City's Erling Haaland on track to break Premier League scoring record (Score: 0.0083)\n",
            "2. Team: Tottenham Hotspur (Score: 0.0083)\n",
            "3. Match: CHE vs TOT (3-0) (Score: 0.0082)\n",
            "4. Team: Chelsea (Score: 0.0082)\n",
            "5. Team: Manchester City (Score: 0.0081)\n",
            "Here's an exciting recent Premier League match result for you:\n",
            "\n",
            "- **Chelsea vs Tottenham Hotspur**\n",
            "  - **Date**: March 25, 2025\n",
            "  - **Stadium**: Stamford Bridge\n",
            "  - **Result**: Chelsea 3-0 Tottenham Hotspur\n",
            "  - **Summary**: Chelsea dominated the London derby with a 3-0 victory at Stamford Bridge. Cole Palmer scored twice in the first half, and Nicolas Jackson added a third goal in the 78th minute. Spurs found it difficult to create any clear chances throughout the match.\n",
            "\n",
            "If you want more match results or details, just let me know! 🎉⚽\n",
            "==================================================\n",
            "\n",
            "==================================================\n",
            "QUERY: Player injuries news\n",
            "==================================================\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "WARNING:openai.agents:OPENAI_API_KEY is not set, skipping trace export\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Found 5 relevant items:\n",
            "1. News: Manchester United's Bruno Fernandes wins Player of the Month (Score: 0.0083)\n",
            "2. News: Liverpool suffer injury blow as Salah ruled out for three weeks (Score: 0.0083)\n",
            "3. Match: ARS vs MNC (1-1) (Score: 0.0082)\n",
            "4. Team: Inter Milan (Score: 0.0082)\n",
            "5. Team: Manchester United (Score: 0.0081)\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "WARNING:openai.agents:OPENAI_API_KEY is not set, skipping trace export\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Here's some fresh injury news from the world of sports:\n",
            "\n",
            "### Liverpool:\n",
            "\n",
            "- **Mohamed Salah** is facing a setback! 😢 The star forward has been ruled out for three weeks due to a hamstring strain. He sustained the injury during Liverpool's recent match against Manchester United. This comes at a bad time as Liverpool prepares to face off against Arsenal and Manchester City. Manager Jürgen Klopp described the situation as \"unfortunate timing.\" \n",
            "\n",
            "Stay tuned for more updates! ⚽🔍\n",
            "==================================================\n",
            "\n",
            "==================================================\n",
            "QUERY: Bayern Munich performance\n",
            "==================================================\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "WARNING:openai.agents:OPENAI_API_KEY is not set, skipping trace export\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Found 5 relevant items:\n",
            "1. News: Bayern Munich's Harry Kane scores perfect hat-trick in Der Klassiker (Score: 0.0083)\n",
            "2. Team: Bayern Munich (Score: 0.0083)\n",
            "3. Team: Bayern Munich (Score: 0.0082)\n",
            "4. News: Bayern Munich's Harry Kane scores perfect hat-trick in Der Klassiker (Score: 0.0082)\n",
            "5. Match: BAY vs BVB (4-0) (Score: 0.0081)\n",
            "Bayern Munich is on fire! 🎉\n",
            "\n",
            "1. **Harry Kane's Hat-Trick Magic**: Harry Kane recently scored a *perfect hat-trick* (right foot, left foot, and header) as Bayern Munich crushed Borussia Dortmund 4-0 in Der Klassiker. Kane, who joined from Tottenham, is thriving in the Bundesliga, saying he's loving his time in Munich and the fantastic football they're playing!\n",
            "\n",
            "2. **Match Details**: In that same match, apart from Kane's brilliant performance, Leroy Sané also got on the scoresheet, leading Bayern to a dominant victory at the Allianz Arena.\n",
            "\n",
            "Bayern Munich is clearly playing some dazzling football right now! ⚽🥳\n",
            "==================================================\n"
          ]
        }
      ],
      "source": [
        "from agents import Agent, Runner\n",
        "\n",
        "os.environ[\"OPENAI_API_KEY\"] = OPENAI_API_KEY\n",
        "virtual_primary_care_assistant = Agent(\n",
        "    name=\"Sports Assistant specialised on sports queries\",\n",
        "    model=OPENAI_MODEL,\n",
        "    instructions=\"\"\"\n",
        "      You can search information using the tools hybrid_search, be excited like you are a fun!\n",
        "    \"\"\",\n",
        "    tools=[hybrid_search],\n",
        ")\n",
        "\n",
        "example_queries = [\n",
        "    \"Recent Manchester United games\",\n",
        "    \"The Red Devils, how did they do?\",\n",
        "    \"Who won El Clasico?\",\n",
        "    \"Premier League match results\",\n",
        "    \"Player injuries news\",\n",
        "    \"Bayern Munich performance\",\n",
        "]\n",
        "\n",
        "# run_result_with_tools = await Runner.run(virtual_primary_care_assistant, input = \"Who won El claisco you know?\")\n",
        "\n",
        "print(\"Testing agentic hybrid search with example queries:\")\n",
        "print(\"=\" * 50)\n",
        "\n",
        "for query in example_queries:\n",
        "    print(\"\\n\" + \"=\" * 50)\n",
        "    print(f\"QUERY: {query}\")\n",
        "    print(\"=\" * 50)\n",
        "    run_result_with_tools = await Runner.run(\n",
        "        virtual_primary_care_assistant, input=query\n",
        "    )\n",
        "    print(run_result_with_tools.final_output)\n",
        "    print(\"=\" * 50)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.6"
    },
    "widgets": {
      "application/vnd.jupyter.widget-state+json": {
        "state": {}
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
